Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

Background Basecalling long DNA sequences is a crucial step in nanopore-based DNA sequencing protocols. In recent years, the CTC-RNN model has become the leading basecalling model, supplanting preceding hidden Markov models (HMMs) that relied on pre-segmenting ion current measurements. However, the CTC-RNN model operates independently of prior biological and physical insights. Results We present a novel basecaller named Lokatt: explicit duration Markov model and residual-LSTM network. It leverages an explicit duration HMM (EDHMM) designed to model the nanopore sequencing processes. Trained on a newly generated library with methylation-free Ecoli samples and MinION R9.4.1 chemistry, the Lokatt basecaller achieves basecalling performances with a median single read identity score of 0.930, a genome coverage ratio of 99.750%, on par with existing state-of-the-art structure when trained on the same datasets. Conclusion Our research underlines the potential of incorporating prior knowledge into the basecalling processes, particularly through integrating HMMs and recurrent neural networks. The Lokatt basecaller showcases the efficacy of a hybrid approach, emphasizing its capacity to achieve high-quality basecalling performance while accommodating the nuances of nanopore sequencing. These outcomes pave the way for advanced basecalling methodologies, with potential implications for enhancing the accuracy and efficiency of nanopore-based DNA sequencing protocols. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05580-x.


Introduction
The concept of nanopore-sequencing was first drafted in 1989 as a hand-sketched illustration by David Deamer on a page of a notebook [1].30 years later, the technology is now commercially available from Oxford Nanopore Technologies (ONT) [2].Nanopore sequencing works by threading a single-stranded DNA (ssDNA) molecule through a protein-formed pore in a membrane, where the sequence of nucleotides along the ssDNA can be indirectly recorded through their effect on an ion current flowing through the pore.However, transforming the current measurements into a sequence of nucleotide bases, i.e., basecalling, is challenging for several reasons [3].Firstly, multiple nucleotides along the ssDNA, also known as a k-mer where k is the number of nucleotides, simultaneously affect the noisy current measurements at any given time.Secondly, the ssDNA's translocation speed through the pore is fast and unstable, leading to a random and apriori unknown number of current samples per nucleotide base in the measurements.Further, the short-term average translocation speed and the noise level are also variable across a long sequencing run due to measurement induced changes in the experimental conditions.These challenges made early nanopore sequencing unusable for most clinical and research applications.Although these problems have now been mitigated by, e.g., selecting and modifying various protein pores with narrower constriction, using improved DNA ratcheting enzymes to control the ssDNA's translocation speed, and better basecalling algorithms, the technology is still limited for many applications due to high error rates, amounts of material, and costs [4][5][6].
Early basecalling algorithms worked by first segmenting the current measurements into a sequence of probabilistic events [7].These events were then treated as observations in a graphical model, usually as a Hidden Markov model (HMM) with latent states representing the dominating k-mers [1,8].The representation of observation probabilities in HMMs went from simple Gaussian distributions parameterized by the mean and variance of the ionic current during an event [9,10] to more elaborate models such as hierarchical Dirichlet processes [11].With the probabilistic model in place, the final basecalling step could be completed using standard inference algorithms for HMMs, such as the Viterbi and the beam search (BS) algorithms.However, the performance of the HMM remained severely limited by the quality of the segmentation step and the choice of features used to model the distribution of the events.
To avoid the limitations of the HMM basecallers, most modern basecallers are based on end-to-end deep neural networks (DNN), following the pioneering work that led to the Chiron basecaller [12].Specifically, Chiron applied a recurrent neural network (RNN) to process the current measurements and a connectionist temporal classification (CTC) layer from natural language processing (NLP) to replace the event-based HMM for data alignment.The resulting network was then trained in an end-to-end manner.The previously used event features have thus been replaced by neural networks, which obviate the need for explicit feature engineering, and the HMMs were replaced by the much simpler CTC structure.The ONT open-source software Bonito, which also uses a CTC-RNN structure and end-to-end training, performs on par with the current stateof-the-art commercial software Guppy, which presumably also uses a deep learning solution.
Inspired by these pioneering projects, recent research into basecallers has mainly explored variations in the neural network structure.Networks with recurrent units such as long-short term memory (LSTM) networks and RNNs, temporal convolution networks (TCNs), and attention/transformer networks such as the convolution-augmented transformer have all showed acceptable basecalling accuracy when combined with a CTC layer [13][14][15].Another popular approach is to build the basecaller with the attention structure.However, basecallers built solely with attention have not yet demonstrated higher accuracy for basecalling [16].This may be because nanopore signals have more vague transitions between nucleotides than words in NLP tasks, especially within homopolymers or repeated sequences of purines or pyrimidines.
This said, the lack of linguistic rules for DNA sequences does not mean there is no prior knowledge about the process that generated the nanopore data.For example, some studies model the ratcheting enzyme, e.g., a helicase, as finite state space machines with well-defined state transition probabilities driven by ATP concentration [17].However, it is not straightforward to incorporate such knowledge in basecallers solely built with deep learning, although we believe that incorporating such prior knowledge could provide a pledge of in-depth understanding of nanopore sequencing and a new direction for future developments.
With this in mind, we wanted to revisit HMMs for basecalling while explicitly addressing some of the shortcomings of prior HMM basecallers.To this end, we propose a new hybrid basecaller called Lokatt that uses an explicit duration HMM (EDHMM) model with an additional duration state that models the dwell time of the dominating k-mer.The duration state allows the basecaller to be applied directly to the raw current measurements by circumventing the need to pre-segment the data, which was problematic in previous HMM basecallers.It also allows us to explicitly model the probability distribution of k-mer dwell times within the pore, e.g., based on a physical understanding of the ratcheting enzyme.However, we still use a neural network to estimate the individual kmer probabilities, and we train our basecaller using end-to-end training.
We trained and evaluated the Lokatt model on a methylation-free Ecoli dataset acquired locally with an ONT MinION device and Guppy.In order to establish a comparative baseline with the state-of-the-art architecture, we also trained the Bonito model from scratch using the identical training dataset.Additionally, to assess the models' generalization capability, we extended our evaluation by employing a dataset from [18] that consists of ten different bacteria.Our benchmarking provides a proof of concept indicating that such hybrid models can exhibit comparable performance to the Bonito model in raw read accuracy and consensus assembly quality when training on the same dataset while opening up possibilities for engineered structures.

The Lokatt model
An HMM is a generative Bayesian model used to represent the relationship between a sequence of latent states, which are assumed to be generated according to a Markov process, and a corresponding sequence of observations.For the basecalling task at hand we focus on building a hierarchical HMM structure to encode the temporal dependencies between a k-mer sequence of length M, denoted by K {K 1 , K 2 , . . ., K M } where K m ∈ K {1, . . ., 4 k } , and a current measurement sequence of length N, denoted by {X 1 , X 2 , . . ., X N } , where typically M ≪ N.
The duration of any state with a self-transition in a Bayesian state-space model is always geometrically distributed.This is inconsistent with the dwell times reported for both polymers and helicases [19,20], two popular candidates for ratcheting enzymes.This inconsistency causes basecalling errors in regions rich in homopolymers since current variations are relatively small here, implying that the basecaller can only rely on the statistical modeling of the translocation speed.We address this problem via the introduction of a sequence of M explicit duration variables [21], denoted by D {D 1 , D 2 , . . ., D M } where D m ∈ N .This provides flexibility in terms of assigning an arbitrary dwell-time distribution.Each pair (K m , D m ) for m = 1, . . ., M thus encodes the mth k-mer and its dwell time.However, to encode potentially very long dwell times using a limited number of states, we will also allow self-transitions between the states encoding the distribution of D m for m = 1, . . ., M.
In the following two subsections, we formalize the hybrid data model on which Lokatt is based, beginning with the EDHMM and continuing with the DNN observation model.

The EDHMM structure
We consider a hierarchical and generative Bayesian EDHMM for the nanopore data, constructed as follows.We first draw the number of k-mers, M ∈ N , from some distri- bution ζ(M) .Then, we draw the first-order Markov sequence of k-mers K starting with K 1 from a distribution ξ 0 (K 1 ) followed by recursive draws of K m from K m−1 according to a conditional distribution ξ(K m |K m−1 ) for m = 2, . . ., M .At the same time, we draw the sequence of dwell-times D by drawing each D m independently from some distribution η(D) for m = 1, . . ., M .Finally, D m measurements X (m,d) for d = 1, . . ., D m are drawn for each k-mer K m for m = 1, . . ., M , from a conditional distribution ϕ(X (m,d) |K m ) .Here the state variable d acts as a counter that counts down to the next draw of a k-mer.
For notational convenience, we let X denote the sequence of measurements in the order of which they would be obtained, i.e., and use simply X n for n = 1, . . ., N , where N = M m=1 D m , when speaking of the nth consecutive measurement.The joint probability distribution of the model is given in Eq. (1).
To allow for computationally efficient inference using the model, we make two additional assumptions on the model: First, we assume that the sequence length is geometrically distributed such that ζ(M) = (1 − α)α M for some α ∈ [0, 1) ; Second, we assume the duration distribution has a geometric tail such that η(D + 1) = γ η(D) for all D ≥ D for some γ ∈ [0, 1) and some maximum explicit duration constant D ∈ N .The value of these assumptions is that they are inherently encoded in a finite state-space model using self-transitions between states.The generative model can, with these assumptions, be encoded as a state model with a start state A, an end state B, and |K| × D intermediate state pairs (K, d) representing joint k-mer and duration states.The intermediate states are stochastically reset according to the explicit duration probability η upon drawing new k-mers.The set of paths from A to B is isomorphic with the set of pairs (K , D) , and a pair can be drawn from p(K , D) by a random walk from A to B with the following state transition rules: (1) The state-space model is a standard implementation of an EDHMM [21].The mth time the process returns to a state of the form (K m , 1) for K m ∈ K marks the end of k- mer K m in the data.Each measurement, X n = X (m,d) , can also be drawn directly based on the value of state (K m , d) , although we, in our particular implementation, assume that the observations are independent of d.
The value of the state-space representation is that it allows for efficient inference while maintaining model interpretability [22].From the specific model presented above, the EDHMM can be constructed into a graph with size (M × D) × N , as illus- trated in Fig. 1.The joint data likelihood and k-mer sequence p(K, X) can, using this graph, be efficiently computed with Eq. (1) by applying the forward algorithm.The data likelihood p(X) can, similarly, be efficiently computed by applying the for- ward algorithm to a slightly altered graph, where K m ∈ K can be arbitrary.The two graphs are sometimes referred to as the clamped and free-running models, respectively [23,24], and allow us to efficiently compute the posterior sequence likelihood as p(K|X) = p(K, X)/p(X) .The conditionally most likely sequence of states and dwell-times (K , D) can be computed using the Viterbi algorithm, and the condition- ally most likely sequence of states K can be approximately obtained using any of our recently introduced greedy marginalized BS (GMBS) algorithms [25].
Choosing the exact distributions in the model are also rather straightforward.The kmer transition probabilities ξ(K m |K m−1 ) can, for example, be estimated with a maximum likelihood (frequency counting) estimator from a reference genome; or by using uniform probabilities for k-mers K m that originate from K m−1 , e.g., set to 1  4 for each possible one-base transition and zero for any other combination, in order not to bias the model towards any particular genome.The value of α needs to be set very close to 1 for the model to plausibly generate reads on the order of hundreds of thousands of bases, and for all intents and purposes, one can set α = 1 in the implementation of the inference algorithms which we also do, although this does, strictly speaking, lead to an ill-defined prior distribution.The dwell-time distribution η(D m ) can be estimated from sequences realigned at the sample-to-k-mer level.In the particular realization of Lokatt studied herein, we use a log-logistic distribution for η with parameters estimated using linear regression from the average number of large changes in the signal amplitude per sample.At the same time, we choose the tail factor γ so that η(D) provides a representative mean dwell time.We also perform a read-specific duration estimation during training and inference in our implementation since the average duration is longer for reads obtained later during the sequencing run due to ATP depletion.Finally, we replaced ϕ(X (m,d) |K m ) with scores obtained from a match network ϕ θ n,K (X) , where K ∈ K , n = 1, . . ., N and where θ are the trainable weights of the (match) DNN described next.

The DNN structure
Lokatt relies on a neural network to dynamically extract the features from the current samples, then map them into scores associated with each k-mer, i.e., ϕ θ n,K (X) .As we want the basecaller to be capable of handling various lengths of inputs, we construct a network with convolution kernels and recurrent units.In particular, we use two types of blocks: i) two residual-convolution blocks consisting of three convolution layers, with 32, 64, 32 features respectively, and a skip connection [26]; and ii) two bi-directional LSTM blocks [27], with 256 features in the first block and 1024 for the second block on each direction.In the end, a dense layer is applied to map the 2 × 1024 = 2048 features into |K| = 4 k output dimensions, which in Lokatt is 1024 with k = 5 .Lokatt contains two of each block type, resulting in a total of 15.3 million weights.We used layer-wise normalization and Swish activation between each layer in the network, which nonlinearly interpolates between a linear function and the ReLU function [28,29].The complete network is shown in Fig. 2A with a decomposition of the residual-convolution block in Fig. 2B and the bi-directional LSTM block in Fig. 2C.

Data generation
To evaluate Lokatt, we performed ONT MinION sequencing on non-methylated E.coli genomic DNA (D5016, Zymo Research) in two repeated runs.The choice of this DNA type stemmed from an initial assumption regarding the potential significance of DNA methylation in basecalling.The sequencing libraries were prepared by fragmenting the genomic DNA using Covaris g-TUBE and a Ligation sequencing kit (SQK-LSK109, Oxford Nanopore).The first sequencing run, conducted in December 2019, used Flow Cell chemistry R9.4.1 and was basecalled with Guppy 3.2.2.The second sequencing run took place in November 2021 with Flow Cell chemistry R9.4.1 and Guppy 5.0, aimed at obtaining a recent comparison with state-of-the-art software and generating distinct datasets for training and evaluation purposes.
Furthermore, short-read Illumina sequencing was performed using TruSeq PCR-free library preparation on the MiSeq sequencing platform (Illumina, USA).A draft assembly was constructed from the Illumina data using SPAdes v3.6.0 [30], serving as the reference genome.The ground-truth nucleotide sequence for each raw read was obtained by mapping its tentative sequence, provided by Guppy, to the reference genome using the aligner program Minimap2 [31].
The data were divided into batches based on their position on the genome.We divided the data from the 2019 sequencing run into batch 1.A and batch 1.B, by separating reads corresponding to the first and second half of the reference genome, respectively.The 2021 sequencing run was similarly divided into batch 2.A and batch 2.B.The data in batch 1.A was used as training data for the model, while the remaining batches were used for validation and performance evaluations.This allowed us to ensure that the model was not over-fitted to the underlying genome by comparing the performance on batch 2.A, where an obvious homology exists, with the performance on batch 1.B and 2.B where no clear homology between the training and test data should exist.Similar data division strategies have been used for human genome data sets, where the training and testing were based on different chromosomes [16].
Fig. 2 The overall structure of the Basecaller: A The main components of Lokatt, from the bottom up, are normalized input, the neural network, the EDHMM layer, and output.The neural network is expanded into main component layer blocks on the right side: two residual blocks, two bi-directional LSTM clocks, and a dense layer.B The inner structure of the residual block consists of three layers of 1D convolution, followed by layer-normalization, and the Swish activation, of which the outputs are taken and added with the inputs of the residual block followed by cross-layer normalization.C The inner structure of the bi-directional LSTM layer consists of two independent LSTM layers, one in the forward direction and the other in the backward direction.The outputs of the two LSTM layers are then concatenated along the feature dimension, making the output sequence the same length as the inputs We have also randomly selected 8000 reads from a data batch 1.A to generate the pretraining data set.For each read and its corresponding nucleotide sequence, we applied the Viterbi algorithm on the clamped EDHMM to find the most likely k-mer and dwelltime sequence, establishing a sample-to-k-mer level alignment.Here, the EDHMM used for such alignment assumes the Gaussian observation probabilities and is trained using the Baum-Welch algorithm on data batch 1.A.
For further assessment of the decoding efficacy and model generalization capabilities, we employed the test dataset established in [18], referred to as data batch 3.This dataset encompasses a diverse spectrum of microbial species, including four distinct variants of Klebsiella pneumoniae, along with six other bacterial entities, namely Acinetobacter pittii, Haemophilus haemolyticus, Serratia marcescens, Shigella sonnei, Staphylococcus aureus and Stenotrophomonas maltophilia.Notably, these samples comprise methylated sequences and underwent sequencing utilizing both R9.4 and R9.4.1 chemistries during the temporal interval spanning the years 2017 to 2018.

End-to-end training
The list of parameters contained in the Lokatt models includes the k-mer transition probability ξ , dwell-time distribution η , tail factor γ and the weights of the DNN θ .As described in Sect.2.1, all parameters were directly obtained by performing maximum likelihood estimation with the training data, except for the weights in Lokatt's DNN, which were trained end-to-end with (semi)-supervised learning using a modified conditional maximum likelihood (CML) approach.
The vanilla CML objective maximizes the conditional log-likelihood, i.e. log p(K|X) , which is often decomposed as log p(K|X) = log p(K, X) − log p(X) .The clamped state- space model is used to compute log p(K, X) and the free-running model is used to com- pute log p(X) , respectively, the former with latent states representing the particular target k-mer sequence and the latter with latent state from K representing all possible sequences.Through the two graphs, the gradients of p(K, X) and p(X) with respect to ϕ θ n,K (X) can be computed as follows [24] and where p(K n = K |X, K) and p(K n = K |X) represent the probability of a visit to k-mer state K at time n, given the observation samples X with its corresponding state sequence K or for all possible state sequences, respectively.The computation of ∂ϕ θ n,K (X)/∂θ can be achieved through error back-propagation within the DNN structure using the built-in functionality of standard software for deep learning.
During the training of Lokatt, we realized that this strategy led to a network that did not generalize well and gave low testing scores.In a sense, the model learned to maximize log p(K|X) by minimizing log p(X) and hence did not provide a very good model of the data.This leads the neural network to output low values of the score ϕ θ n,K (X) overall.Therefore, we instead adopted a modified CML training strategy, where we maximized a weighted linear combination of log p(K|X) and log p(X) to regularize the model to provide a balance between posterior decisions and modeling of the data.In particular, we explicitly used with regularization factor = 1 2 as the overall cost function.We also trained the network with a cost function where p(X) was replaced by p(K BS , X) , i.e, with again with = 1 2 , but where K BS is the output sequence of k-mers from the GMBS decoder presented in Sect.3.3 and discussed in detail in [25].The intuition behind Eq. ( 2) is to make the training focus more on segments of the data where K BS differ from the (correct) reference sequence K .The model trained with this approach showed higher final basecalling accuracy.Therefore, it was used to benchmark Lokatt with the other basecallers.
Training with the regularized CML loss from Eq. ( 2) requires two graphs representing p(K, X) and p(K BS , X) to compute the gradients with respect to the output of the DNN, i.e., ϕ θ n,K (X) , for n = 1, . . ., N .Note here that while p(X) is usually computed using the free running mode, we use the clamped model for both p(K, X) and p(K BS , X) .In our implementation, the size of the two graphs are D × M × 4096 and D × M BS × 4096 , where the lengths of K and K BS , i.e., M and M BS , are on the order of a few hundred bases of overlapping k-mers.To manage the complexity of the inference on these graphs, we rely on custom GPU implementations that can efficiently utilize the sparsity of the graphs.In particular, we implemented the gradient calculation for the EDHMM as custom CUDA kernels [32] and registered them as customized operations in Tensor-flow2 [33].The gradients were then back-propagated to the DNN to calculate weight updates using a standard batch-based Adam optimizer [34] from Tensorflow.We trained the whole DNN on the NSC Berzelius compute cluster using 8 Nvidia A100 GPUs, 1

Decoding with the GMBS
In the final stage of basecalling, often referred to as decoding, the objective is to find the sequence of k-mer with the highest posterior probability, i.e.K = arg max K p(K|X) .The optimal solution to this problem is intractable due to the exponential growth of the 1 Berzelius: https:// www.nsc.liu.se/ suppo rt/ syste ms/ berze lius-getti ng-start ed/ search space as read length increases.This necessitates an exhaustive search across all possible k-mer sequences with lengths M that not exceeds N.An alternative involves utilizing the Viterbi algorithm, noted for its computational efficiency, to approximate the optimal solution by identifying the jointly optimal sequence (K, D) that maximizes p(K, D|X) .Nonetheless, this approach lacks a solid theoretical guarantee regarding esti- mation quality.Moreover, the accurate decoding of the optimal k-mer sequence requires the marginalization of duration states on the EDHMM graph, a challenge for which the Viterbi algorithm lacks a dedicated strategy.
We instead rely on our newly proposed GMBS algorithm, which approximates the maximum a posteriori solution by recursively searching and maintaining a fixed-size list of k-mer sub-sequences K m (K 1 , . . ., K m ) , i.e., beams, with high posterior probability p(K m |X n ) for X n (X 1 , . . ., X n ) and m ≤ n .Each beam also keeps track of the probabil- ity of p(K m , D m |X n ) from which it can compute the probability of new beams and mar- ginalize over D m as needed.The empirical experience in [25] has shown that the GMBS algorithm performs better than the Viterbi algorithm on decoding the EDHMM graph when we use 512 beams, with a significantly lower memory footprint.The GMBS performances can be improved by increasing the number of beams at the cost of slower pruning operations due to the sorting complexity increase.Specifically, the parallel sorting algorithms implemented in the GMBS scale, for a beam list size of B, as O(log 2 B) [35].
In implementing the GMBS algorithm, we use a tree structure to store K m in mem- ory, where common ancestry represents common initial sub-strings of K m .Since this tree has at most N × B nodes, it can be efficiently implemented on the GPUs without needing dynamic memory allocation.Finally, to extract the final approximate maximum a posteriori k-mer sequence K = K M , we can read this tree backwards from the highest- scoring leaf node to the root in a fashion similar to the backtracking step of the Viterbi algorithm.A detailed explanation of the LFBS algorithm is provided in [25].
When basecalling with Lokatt, we divide each raw read into segments of length 4096 with an overlap of 296 measurements, and these segments were individually basecalled.The uniform length of input sequences facilitates efficient parallel implementations of Lokatt without harming the basecalling performance.The resulting output sequences were subsequently assembled by aligning the beginning and ending portions of consecutive pairs.

Benchmarking
We benchmark the Lokatt model with the Bonito model over all three data batches.Bonito [36] is an open-source research tool released by ONT that harnesses the state-ofthe-art CTC-RNN model.To investigate the impact of the model architecture, we independently trained a Bonito model, referred to as Bonito Local, from scratch with the same training data used in training the Lokatt basecaller.Additionally, we included the performance from the fine-tuned Bonito basecaller dna_r9.4.1_e8_sup@v3.3, denoted as 'Bonito Sup' , which yields to give 'super accuracy' and is presumably trained on a more extensive data set.For data batches 2.A and 2.B, we also incorporated results obtained from ONT's Guppy 5.0, which is ONT's commercial basecaller.The data batches 1.A and 1.B were sequenced with earlier version of Guppy 3.2.2,whose performance is 2% − 5% lower than Guppy 5.0 and therefore excluded in the benchmark as it no longer represents state-of-the-art.In the following sections, we will discuss the basecalling quality in terms of raw read accuracy and assembly accuracy.In particular, we will also report basecalling performances on homopolymer regions.The homopolymers denote sequences of consecutive identical bases, presenting a substantial challenge in the basecalling process due to their similar current modulation and variable duration for each identical base.

Raw read accuracy
To access the raw read accuracy, we measured the shortest Levenshtein distance between the basecalled sequences from Lokatt, Bonito Local and Bonito Sup and the reference genome and generated their pair-wise alignments.This entails determining the minimal number of single-state edits, such as insertions, deletions or substitutions, needed to convert the basecalled sequence into the Illumina-generated reference genome [37], which can be obtained using Minimap2 v.2.17-r941 with the default parameters.The alignments are then quantified and reported as the sequence accuracy metrics, including the identity, mismatch, insertion and substitution scores.Specifically, the identity score is formulated as the ratio of matched bases to the total alignment columns as follows: Similarly, the error scores, including mismatch, insertion and substitution, are calculated as the ratios of their individual count to the overall alignment columns.In addition, we also presented measurements of matched base length per read (read length), counts of matched read entries (entry counts) and the cumulative number of matched bases.It is important to note that both mean and median values were reported for accuracy metrics and read length.However, for our analysis, the focus was directed towards median values.This choice is attributed to the significant data variance, wherein reads with low accuracy are too noisy for Minimap2 to recognize [7,18].
The outcomes of the benchmarking experiments are presented in Tables 1 to 5, each highlighting the basecaller performances across different data batches.1, with error rates exhibiting a similar ±0.002 difference.This minimal variation between batch 1.A and 1.B suggests that overfitting of the models is unlikely.Notably, Lokatt continues to exhibit the longest read length, while Bonito Sup maintains the highest entry count.This also contributes to Lokatt reporting 3% fewer matched bases than Bonito Sup, but 3% more than Bonito Local.
Tables 3 and 4 display the testing result from data batch 2.A and 2.B, which were generated from the same E.coli samples used in data 1 but more recently.Compared to results from data batch 1.A and 1.B, all basecaller exhibit identity score 0.007-0.014higher and, at most, error score 0.008 lower.Regarding the read length, Bonito Local shows an increase of 5% , while both Lokatt and Bonito Sup demonstrate an increase of 14% and 13% , respectively.The substantial increase in entry counts and total matched bases is because data batch 2 sequenced 3-4 times more samples, which makes the comparison trivial.Nonetheless, when comparing among basecallers on   Additional results from Guppy 5.0, ONT's commercial software at the experiment's time, are also reported in Tables 3 and 4. Guppy's overall performance closely aligns with Lokatt's; Guppy exhibits a 0.003 higher identity score and basecalled only 0.25% more total bases.Table 5 presents the testing performance on data batch 3 with diverse species sequenced at an earlier period.Table 5 only contains identity score and total base counts due to space limit; however, a comprehensive performance table is available in Additional file 1: Table S1.Notably, all basecaller exhibit reduced identity scores for all species except Staphylococcus aureus, most recently sequenced among data 3 and utilizing the updated chemistry R9.4.1.Overall, Lokatt demonstrates an average identity score of 0.904, which declined by 0.029 on average and 0.049 on maximum, compared with data 2; Bonito Local shows an average identity score of 0.880, with a decline of 0.046 and maximum decline of 0.079; Bonito Sup has an average 0.947 with a decline of 0.019 on average and 0.031 in maximum.For total matched bases, Lokatt holds 6% more than Bonito Local but 7% fewer than Bonito Sup and 0.3% fewer than Guppy.
In addition to the read accuracy, we also assessed the performance of basecallers in handling homopolymer regions on data batch 2 and data batch 3 with respect to each species, as presented in Table 6.The performance evaluation on the homopolymer regions involved measuring the number of correctly basecalled homopolymers, the total count of homopolymers within the basecalled area, and their ratio represented as accuracy.The number of homopolymers varies vastly across different datasets due to variations in genomes and data coverage rates.Specifically, the E. coli dataset (data batch 2) stands out with nearly five times more homopolymers than the other datasets because of its higher coverage.Aggregating the homopolymer numbers across all species, Bonito Sup demonstrated the highest accuracy at 0.859, surpassing both Bonito Local with an accuracy of 0.768 and Lokatt with 0.748.However, Lokatt's output sequences covered 0.4% more homopolymers than Bonito Sup and  S2.

Basecaller complexity
Incorporating a DNN structure, both Lokatt and Bonito introduce computational demands that can potentially restrict their applicability.Bonito 0.5.0 utilizes the CTC-Conditional Random Field [38] top layer, integrated with convolution layers followed by LSTM layers in alternating forward and reverse directions, totaling 26.9 million parameters.In contrast, Lokatt adopts residual blocks comprising convolution layers, followed by bi-directional LSTM layers and an EDHMM layer, collectively having 15.3 million parameters.It is noteworthy that Bonito's employment of a stride of 5 in its convolution layer has effectively reduced the computationally expensive recurrent calculations within the LSTM layer by a factor of 5. Consequently, despite being nearly twice as large as Lokatt, Bonito (50k base pair per second) achieves approximately a 5-fold speed enhancement compared to Lokatt (8k base pair per second) when executed on the Nvidia V100 GPU.
The inference speed of Lokatt can be enhanced by improving either the efficiency of the DNN or the GMBS algorithm, as these two components collectively account for 98% of the current inference runtime.Optimizing the DNN can involve further reducing the model size, while the GMBS algorithm can benefit from code optimizations, such as improved memory management.Inference time can also be achieved by reducing the number of beams.However, it is important to note such optimizations may lead to a trade-off with overall performance.
In the current implementation, the DNN consumes 40% of the runtime, and the GMBS consumes 58% of the time.According to Amdahl's law, optimizing either of these steps in isolation can not lead to an improvement beyond a factor of 2 in inference speed.That being said, it is worth considering that the complexity of the DNN and the GMBS scales linearly with the length of the input.Emulating Bonito's approach of introducing a stride of 5 in the early convolutional layers would, therfore, also increase Lokatt's inference speed by a factor of 5.However, this approach might compromise the model's interpretability regarding the dwell-time distribution, which is why we have yet to pursue this further.

Consensus evaluation
The basecalled read sequences on data batch 2 from Lokatt, Bonito Local, Bonito Sup and Guppy 5.0 were assembled, respectively.De novo genome assemblies were generated using Flye [39], resulting in four distinct draft genome assemblies.The evaluation of these assemblies against the reference Illumina E.coli genome was executed with Quast [40].The assessment, as depicted in Fig. 3a, reveals a genome coverage of 99.750% for Lokatt, while the remaining three basecallers exhibit a slightly higher cover- age of 99.757% .The proportions of GC content remain comparably consistent among the four basecallers: Lokatt at 50.88% , Bonito Local at 50.84% , Bonito Sup at 50.82% , and Guppy 5.0 at 50.83% , in contrast to the reference genome's 50.77% , as illustrated in Fig. 3b.The same contig length distribution and contig connectivity are shared among all basecaller's assemblies, reflected in the NGAx plot in Fig. 3c.The NGA50 values, specified in Fig. 3d, exhibit marginal disparities: 106101 for Lokatt, 106277 for Bonito Local, 106360 for Bonito Sup, and 106293 for Guppy 5.0.Notably, Lokatt shows the least number of misassemblies at 202, compared to 204 for both Bonito basecallers and 203 from Guppy, as depicted in Fig. 3e.Furthermore, the assessment of mismatches per 100k base pairs, illustrated in Fig. 3f, highlights Lokatt's score of 8.22k, which is notably lower than Bonito Local's 9.9 and Bonito Sup's 9.69, and approaches the minimum value of 8.14 attained by Guppy.The complete assemblies report is available in Additional file 1: Table S3.

Discussion
The evaluation of basecallers, Lokatt, Bonito Local, Bonito Sup, and Guppy 5.0, on various datasets yielded insights into their performances and limitations.Analyzing raw read accuracy in E.coli data batch 1 and 2, Lokatt's performance emerged as competitive with the Bonito basecallers.Lokatt achieved a slightly higher median identity score than Bonito Local by 0.01, while falling slightly behind Bonito Sup by 0.033.The assessment of generalization capabilities on the extra data batch 3 showed a decrease in performance for both Lokatt and Bonito Local, with the latter experiencing a more notable decline.This phenomenon aligns with reported observations in cross-species benchmarks [41], where basecalling accuracy tends to decrease as the genomic distance between species increases.For genomes with substantial divergence, such as those of Homo sapiens and Lambda phage, this accuracy reduction can be significant.We attribute the observed decline on data batch 3 in Lokatt and Bonito Local performance to this cross-species effect, as they were exclusively trained on the first half of the E.coli genome.Bonito Sup exhibited superior accuracy, although limited information was available regarding its training data and methodology.Furthermore, the sequencing of data batch 3 using older chemistry compared to the training data of Lokatt and Bonito basecallers also contributed to the decrease in performance.On handling the homopolymers, Lokatt's output covered a larger number of homopolymer regions compared with both Bonito basecallers.However, this was associated with lower accuracy, especially for homopolymers of lengths exceeding 5. Notably, the study underscored the Lokatt model's capability to effectively handle small-genome species sequencing data at a level comparable to the Bonito model.
The assembly analysis introduced a noteworthy observation: the correlation between genome coverage and raw read accuracy across basecallers isn't always straightforward.Specifically, Lokatt, despite having cumulatively 7% more matched bases than Bonito Local and a close number to Guppy, exhibited marginally 0.007% lower genome coverage on the E.coli genome.However, Lokatt maintained the least misassemblies and relatively low mismatches per 100kbp.Bonito Local on the other hand, having the lowest identity score and fewest base counts, achieves the same genome coverage as Bonito Sup, which shares an identical model architecture with Bonito Local but with different weights.This discrepancy might arise from differences in error-handling strategies among basecallers, potentially indicating that Lokatt excels in specific genomic regions but faces challenges in others.Alternatively, this could be attributed to Lokatt's current lack of quality scores associated with basecalled nucleotides, which play a crucial role in the assembly process and subsequent analyses.Furthermore, it's important to note that the evaluation primarily focused on E.coli data, which may not fully capture the challenges posed by diverse species and complex genomic regions.
The study also highlighted the distinct design and training strategies of Lokatt.Bonito utilizes the CTC loss, where a blank state is manually inserted between every nucleotide.The CTC loss typically leads to a dominance of blank predictions in the output of the CTC-trained RNN [12], prompts presumably a less complicated task where the RNN has learned to detect the transition moment of the nucleotides and treat everything else as blank.However, it also potentially limits model flexibility.In contrast, Lokatt integrates an EDHMM modelling the dynamic of the ratcheting enzyme, and is tasked to learn the complete characteristics of the ion current measurements.We have separated the complicated task into the pre-trained stage and the subsequent end-to-end training.We observed that the pre-trained neural network could achieve a median basecalling accuracy of around 0.905 on data batches 1 and 2. The subsequent CML training further refined Lokatt's accuracy by around 0.02-0.03.In light of this, although HMMs are capable and comparable in various respects, achieving fully interpretable parametric models remains a challenge, leading to limitations in Lokatt's performance.

Conclusion
This research project has culminated in the development, training, and evaluation of Lokatt, a novel hybrid basecaller.The raw read performance of the Lokatt model was better than the CTC-CRF structured Bonito model if trained on the same dataset, and was comparable to ONT-trained Bonito and Guppy basecallers.Lokatt's evaluation metrics for consensus sequencing resembled those of the other basecallers.Notably, Lokatt demonstrated fewer misassemblies and mismatches per 100kbp, despite having a lower overall genome coverage ratio.This scenario highlights the complex trade-offs that basecallers need to make between accuracy, coverage, and the nature of the underlying genomic sequences.Different basecallers may excel in different contexts, and the choice of which to use depends on the specific goals of an analysis, such as accurate assembly of specific regions versus comprehensive genome coverage.Both Lokatt's and Bonito's architectures leverage DNN structures enhanced with dynamic models to bridge the gap between input current measurements and output bases and enable end-to-end training.However, Lokatt's unique integration of an HMM layer introduces a novel dimension of comprehension into the sequencing dynamics, thereby creating opportunities for future refinements.Future versions of Lokatt could potentially exploit a more sophisticated dynamic structure, informed by a deeper understanding of the sequencing device and chemistry process.This study contributes to the field by introducing an innovative basecaller model and insights into basecaller performance.

Fig. 3
Fig. 3 Assembly performances for basecallers: A The genome coverage ratios, computed as the ratio between the total number of aligned bases in the contigs and the reference length.B The GC content plot, showing the distribution of GC content in the contigs.C Cumulative length plot reflecting the length growth of the aligned contigs against the reference genome.D The shortest aligned contig for which longer and equal length contigs cover at least 50% of the assembly, i.e., NGA50 plot.E The misassemblies where the contig has a stretched aligned position on the reference genome longer than 1k bases.F The average number of mismatches per 100k aligned bases.(The detailed description of the terminologies used in the plots can be found in the QUAST manual [40])

Table 1
displays the training performance on data batch 1.A. Lokatt achieves a median identity score of 0.926, surpassing Bonito Local by 0.012 but falling 0.028 behind Bonito Sup, with corresponding lower error rates than Bonito Local and higher error rates than Bonito Sup.Regarding median read length, Lokatt records the longest matched length of 2807, compared to both Bonito models' 2731.However, for recognizable read entries based on Minimap2, Lokatt processes 565393 entries.This

Table 1
Training performances on data batch 1.A Lokatt demonstrates 3% fewer total matched bases than Bonito Sup, but 3% more than Bonito Local.Table2provides the test performance of data batch 1.B, which corresponds to reads aligned with the second half of the E.coli genome.The results indicate the median identity scores are within ±0.001 difference from the training scores shown in Table

Table 2
Testing performances on data batch 1.B *First five entries are listed as 'mean/median'

Table 3
Testing performances on data batch 2.A *First five entries are listed as 'mean/median'

Table 4
Testing performances on data batch 2.B Lokatt demonstrates identity scores 0.006 and 0.007 higher than Bonito Local data 2.A and 2.B respectively and 0.033 lower than Bonito Sup, while output 5% fewer matched bases than Bonito Sup with 7% more than Bonito Local in total.
*First five entries are listed as 'mean/median'

Table 5
Testing performances on data batch 3 *Identity scores are listed as 'mean/median' 7% more than Bonito Local.Yet, Lokatt exhibited 14.3% less correctly basecalled homopolymers than Bonito Sup but 6.3% more than Bonito Local.It's noteworthy that as the length of homopolymers increased, Lokatt's output consistently covered more homopolymers than Bonito Local and was comparable to Bonito Sup.However, Lokatt's accuracy experienced a noticeable decline in such instances.Detailed performance metrics for homopolymers of varying lengths in data batch 2 are provided in Additional file 1: Table

Table 6
Testing performances on homopolymers